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Abstract. This paper deals with the parallel raytracing part of virtual- 
reality system PROLAND, developed at the home institution of authors. 
It describes an actual implementation of the raytracing part and intro- 
duces a Coloured Petri Nets model of the implementation. The model 
is used for an evaluation of the implementation by means of simulation- 
based performance analysis and also forms the basis for future improve- 
ments of its parallelization strategy. 

1 Introduction 

During the past several years, high-performance and feature-rich PC graph- 
ics interfaces have become available at low cost. This development enables 
us to build clusters of high-performance graphics PCs at reasonable cost. 
Then photorealistic rendering methods like raytracing or radiosity can be com- 
puted faster and inexpensively. Raytracing is one of computer graphics tech- 
niques used to produce accurate images of photorealistic quality from complex 
three-dimensional scenes described and stored in some computer-readable form 
[1, 2, 9]. It is based on a simulation of real-world optical processes. One great 
disadvantage of such techniques is that they are computationally very expen- 
sive and require massive amounts of floating point operations [4, 6]. Parallel 
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raytracing takes advantage of parallel computing, cluster computing in partic- 
ular, to speed up image rendering, since this technique is inherently parallel. 
The use of clusters [8] for computationally intensive simulations and applica- 
tions has lead to the development of interface standards such as the MPI and 
OpenPBS. 

This paper provides insight into various means of decomposing the raytrac- 
ing process (based on the free raytracer Pov-ray [12]) and describes a parallel 
raytracing process management simulation. We decided to use Coloured Petri 
Nets (CPNs) and CPN tools software for the simulation and a performance 
analysis based on it. The CPNs and CPN tools were chosen because of good 
simulation-based performance analysis support, familiar formalism and the 
fact that the tools are available for free. The other reason was that in the case 
of some more complicated raytracing process management design we can spec- 
ify its analytical model using low-level Petri nets and use analytical "tools" 
of Petri nets, such as invariants, and others, including our own results in the 
field of formal methods, for a verification of the model. After that, the analyt- 
ical model can be transformed to the CPN model suitable for a performance 
analysis. 



2 Raytracing and its computation model 

In nature, light sources emit rays of light, which travel through space and 
interact with objects and environment, by which they are absorbed, reflected, 
or refracted. These rays are then received by our eyes and form a picture. 

Raytracing produces images by simulating these processes, with one signif- 
icant modification. Emitting rays from light sources and tracking them would 
be very time-consuming and inefficient, because only a small fraction ends up 
in the eye/camera, the rest is irrelevant. So instead of this, raytracing works by 
casting rays from camera through image plane (for each pixel of final image) 
into the scene and tracking these rays. It computes the intersection of the ray 
with the first surface it collides with, examines the material properties (cast- 
ing additional rays for refraction/reflection if necessary) and incoming light 
from light sources in the scene (by casting additional rays from intersection to 
each source) and then computes the colour of the pixel in the final image [12]. 
Raytracing belongs to a set of problems that utilize parallel computing very 
well, since it is computationally expensive and can be easily decomposed. The 
two main factors influencing the design and performance of parallel raytracing 
systems, are the computation model and the load-balancing mechanism [4]. 
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There are two principal methods of decomposing a raytracing computation: 
demand-driven and data-driven (or data-parallel) , and there are research activ- 
ities focused on developing a hybrid model trying to combine the best features 
of the two models [6] . The final product of raytracer by demand-driven parallel 
raytracing is an image ofmxn pixels, and since each pixel is computed inde- 
pendently, the most obvious way of decomposition is to divide the image into 
p parts, where p is a number of processors available and each processor would 
compute m x n/p pixels and ideally, the computation would be p times faster. 
This approach is called demand-driven parallel raytracing. A number of jobs 
are created each containing different subset of image pixels and these jobs are 
assigned to processors. Input scene is copied to local memory of each proces- 
sor. Processors render their parts, return computed pixels, get another job if 
there is any, and in the end the final image is composed from these parts [1]. 

Main benefits of this approach are easy decomposition and implementation, 
simple job distribution and control and the fact that a general raytracing 
algorithm remains unchanged and scales well. The main disadvantage is that 
the input scene has to be copied to local memory of each processor, which 
poses a problem if the scene is very large. 

Data-driven parallel raytracing approach, also called data-parallel raytra- 
cing, splits the input scene into a number of sections (tiles) and assigns these 
sections to processors [1, 2]. Each processor is responsible for all computations 
associated with objects in this particular section, no matter where the ray 
comes from. Only rays passing through the processor's section are traced. If a 
ray spawned at one processor needs data from another processor, it is trans- 
ferred to that processor. The way the scene is divided into section determines 
the efficiency of parallel computation. Determining the number of rays that 
will pass through a section of the scene in order to estimate the sections re- 
quiring the most processing is one of the hardest problems to overcome. Using 
the cost function can be helpful. Main benefit of this approach is that the 
input scene doesn't have to be copied entirely to each processor, but it is split 
into sections, so even very large scenes can be processed relatively easy. Main 
disadvantage is that this approach doesn't scale very well with growing scene 
complexity and cluster size, because of task communication overhead and ray 
transfers [6]. 

2.1 Parallelization implementation 

For parallel implementation a cluster-based computing system is used. Cluster- 
based rendering [8] in general can be described as the use of a set of computers 
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Figure 1: Basic structure for parallel raytracing implementation 

connected via a network for rendering purposes, ranging from distributed non- 
photorealistic volume rendering over raytracing and radiosity-based rendering 
to interactive rendering using application programming interfaces like OpenGL 
or DirectX. 

For the raytracing itself, a freeware program Pov-ray is used [7] , and atop of 
Pov-ray, a front-end performing parallel decomposition and job control is built. 
Pov-ray is able to render only a selected portion of the picture, so it's very 
convenient for naive parallelization. Implementation is limited by Pov-ray's 
capabilities: 

• only contiguous rectangular section of image can be rendered in one job, 

• each job requires parsing the scene and initial computations all again, 

• each Pov-ray job requires whole scene and 

• program should be able to handle failures of individual nodes. 

Because of these facts, the program implements demand-driven computa- 
tion model. For a load balancing, static or dynamic load balancing by tiling 
decomposition seems to be the best choice. Implementation uses the Message 
Passing Interface and SPMD program model. 

Fig. 1 shows the basic used structure. It isn't a typical master/slave scenario, 
here all nodes are equal, with the exception of the root node, which also 
controls the whole operation, allocates jobs and interacts with the user. That 
allows us to utilize massive parallelism. User puts in the scene to be rendered 
and additional control information. Root (master) node partitions the final 
image plane into sections (tiles) and allocates them to nodes. On each node, 
the process forks and executes Pov-ray to render its part of the image. When 
finished, it returns the rendered pixels and waits for another job, if required. At 
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the end, the root node puts the whole image together and returns it to the user. 
It is a simple algorithm. We need to develop a better strategy for distribution of 
a scene section of rendered image. Better node, time and memory management 
is necessary. Because the development of an improved strategy using "real" 
hardware and software is expensive and very time-consuming, we decided to 
use formal CPN models and an appropriate simulation on them instead. 

3 Coloured Petri Nets 

Coloured Petri Nets (CPNs) [3] is a discrete-event formal modelling language, 
able to express properties such as non-determinism and concurrency. It com- 
bines a well-known Petri nets formalism with an individuality of tokens to 
enhance its modelling power and the CPN ML functional programming lan- 
guage to handle data manipulation and decision procedures. 

A CPN model has a form of digraph with two types of vertices: places 
(ellipses) and transitions (rectangles). 

Each place holds tokens of some type. In CPNs types are called colour sets. 
Colour sets range from simple ones as UNIT (with the only value "()"), INT, 
BOOL, to compound sets such as List, Record or Product. An example of 
user-defined colour sets (record and timed list) can be seen in Fig. 5. Tokens in 
places define state of CPN, which is called marking. Markings are represented 
as multisets, i.e. marking "VI + +7'2" of the place pi from Fig. 2 means that 
pi holds one token of value 1 and seven tokens of value 2. If there is only one 
token in a place, we can omit a number of tokens (for example we can write 
"4" instead of "V4"). 

Transitions of CPN represent events that change the state (marking) of the 
net. A transition t can be executed, or fired, when there are enough tokens of 
corresponding value in places from which there is an arc to t. These tokens are 
removed when t is fired and new tokens are generated in places to which there 
is an arc from t. A number and values of removed and created tokens are 
determined by corresponding guarding predicates (guards), associated with 
transitions, and arc expressions. A small example in Fig. 2 illustrates the 
behaviour of CPN. The net in Fig. 2 has 3 places (pi, p2, p4) of colour set 
INT and one place of colour set UNIT. Initially the net is in the (initial) 
marking with "VI + +7'2" in pi and 4 tokens of value 1 in p2 (Fig. 2(a)). An 
actual marking is shown in boxes left to the places. The transition t with the 
guard "x > y" can be fired only for x = 2 and y = 1 now. The net after the 
firing is shown in Fig. 2(b). 
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Figure 2: CPN fragment before (a) and after (b) the firing of the transition t 



3.1 Performance analysis with CPNs 

To broaden the scope of CPNs usage, facilities allowing simulation-based per- 
formance analysis have been added to both the CPNs language and its sup- 
porting tool, called CPN tools [11]. These facilities include time concept for 
CPNs (timed CPNs), random distribution functions for CPN ML and data 
collecting and simulation control monitors for CPN tools. 

A (model) time in CPNs and CPN tools is represented as an integer value. 
There are also values, called time stamps, associated with tokens, representing 
a minimal time when the tokens are ready for firing. Colour sets of such tokens 
must be timed. In our models we distinguish timed colour sets by a postfix 
"trn" or "Tm" . The model time doesn't change while there is some transition 
that can be fired. When there is no transition to fire, the time advances to the 
nearest time value with some transitions to fire. All time-related information 
in markings, expressions and guards is prefixed by "@". A small example of 
timed CPN can be seen in Fig. 3(a). Both places can hold timed integers. In 
the initial marking we have one token of value 1 and timestamp in tpl . So, 
ttl can fire in time = 0. After firing of ttl one token of value 1 and time 
stamp = firing time +10 appears in tp2 (Fig. 3(b)). In addition, the model 
time advances to 1 0, because there is nothing to be fired in time = and the 
token in tp2 will not be available (ready) before time = 10. 

Because of space limitations we described CPNs very briefly here. An inter- 
ested reader can find more information in [3, 10] or at [11]. 
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(a) (b) 
Figure 3: Timed CPN fragment before (a) and after (b) the firing of ttl 

4 CPN model for parallel raytracing 

A timed CPN specification of our current implementation of distributed ray- 
tracing, as described in section 2.1, can be seen in Fig. 4. The time in our 
model is measured in milliseconds. 

In the initial marking there are tokens in places newScene, nodesNo, 
freeNodes, scStartTime and preparedTLles. The place newScene holds one 
token with randomly chosen value from interval 1 0000 to 70000 (computed by 
the function discrete). This value characterizes a complexity of a scene to be 
raytraced and its range is based on our practical experience. In general the 
scene complexity depends on its size, number of objects, objects complexity 
(number of polygons), objects material (opacity, mirrors, ...), illumination 
model and camera parameters. The place nodesNo holds a token with num- 
ber of computers in our cluster (8 computers) and freeNodes has one token 
for each node, where its value designates a type of the node. Albeit all the 
nodes are equal we have to distinguish between the client nodes {type 2) and 
the master node (type 1) that also manages the whole process. So, only about 
70% of master performance is used for raytracing. The prep aredTiles holds 
one token with empty list of tiles, because the scene is not divided yet. 

Only the transition sendScene can be fired in the initial marking, in time = 
0. Its firing represents sending of the whole scene to each client node. Sending of 
the scene is a sequential process and its duration is computed by an expression 



(nNo - 1 ) * rnNormal_int (20000, 10000) 
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Figure 4: Timed CPN model of distributed raytracing in 8 computers cluster 



where nNo is a number of all nodes and the function rnNormaLint(m,v) re- 
turns a value from the exponential random distribution with mean m and 
variance v. The firing also divides the scene into the list of tiles with (almost) 
constant width and height and saves the starting time point of scene raytrac- 
ing as a token in scStartTime. To store information about a tile the colour 
set TILE is used (Fig. 5), where fields wdt and hgt store tile dimensions, com- 
plxt stores tile complexity, cSuc determines whether tile raytracing will be 
successful and ndType is a type of node where the tile will be raytraced. The 
list is generated by the function getTileList and is stored as a single token in 
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colset TILE = record wdt : INT * hgt:INT * complxt:INT 

* cSuc: BOOL* ndType:INT; 
colset TILElistTm = list TILE timed; 

Figure 5: Declarations of some colour sets 

fun tileCopml (0, remCmpl) = | 

tileCopml (1 , remCmpl) = remCmpl | 
tileCopml (remTiles , 0) = | 
tileCopml (remTiles , remCmpl) = 

let 

val tCmp = (Real . f romlnt remCmpl / 

Real.fromlnt remTiles) 
val cmpl=rnNormalr_int(tCmp*0.8,tCmp *0.7) 

in 

if (remCmpl>cmpl) then cmpl else remCmpl 
end; 

Figure 6: Definition of tileCopml function 

the place preparedTiles. The function also distributes the scene complexity 
randomly among the tiles. This random distribution is computed by the func- 
tion tileCopml (Fig. 6), that is called within getTileList. Its first argument, 
remTiles, is a number of remaining tites to be added to the generated list and 
remCmpl is a complexity to be distributed among remaining tiles. 

A firing of the transition selectTile means an assignment of raytracing job to 
a free node nt. The selected tile t is removed from the list in preparedTiles and 
moves to prep Tile. In addition, the function setSuc-nTp assigns a node type 
(field ndType) to t and randomly chooses a raytracing job success (cSuc) for t. 
We assume that 90% of all jobs on client nodes will be successful and that the 
master node never fails. If the field cSuc of t is true (i.e. "#cSuct = true" in 
CPN ML), then a firing of sucRtrStart moves t to raytr Tiles. The timestamp 
of t is also increased by raytracing time and a communication delay. The 
raytracing time is computed by raytrTm from all fields of t except cSuc. The 
communication delay, computed by commTime, is taken from an exponential 
random distribution and represents the time needed to contact the master 
node, which can be busy performing other tasks, and to send the raytraced 
tile to it. After raytracing sendRtrTile moves the tile into already computed 
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ones (computedTiles) and frees the node used. 

The path of a fallen one begins with a firing of unsucRtrStart, which moves 
t to unsrRcrytr Tiles. The delay computed by failcheckTm is a time needed 
to detect that a given node failed and is not responding. The response of 
nodes is checked regularly in our implementation, so the delay computed is a 
randomly chosen multiple of checking period (chckPer) with some upper limit. 
Next, a firing of returuTile moves t back to the list in prep are dTiles and the 
failed node to invalidNodes, where it waits for recovery. We optimistically 
suppose that each node recovers within one day. Finally the node is returned 
to freeNodes by a firing of recoverNode. 

After successful processing of all tiles the scene can be finalized and the tran- 
sition completeScene fired. Its firing removes all tokens from computedTiles 
and generates a new one in newScene, so a raytracing process can start over 
again. There is a data collecting monitor, which saves information about ray- 
tracing duration and number of used nodes into the text file for further pro- 
cessing when completeScene is fired. 

5 Simulation experiments 

To evaluate our implementation of parallel raytracing under various conditions 
we carried out several simulation experiments on the CPN model created. Here 
we present results concerning the relation between number of nodes in the 
cluster and duration of scene raytracing. In these experiments we fixed the 
scene complexity to 36500 and used a big scene with 30000 x 22500 pixels and 
a small one with 10000 x 7500 pixels. Tile dimensions were 1000 x 750 pixels 
in both cases. We considered two scenarios: 

• an ideal scenario, where all nodes are equal (i.e. the master can use all 
of its performance for raytracing) and no computation fails and 

• a real scenario, with conditions as described in Section 4. 

Number of nodes ranged from 2 to 25. The results obtained are depicted in 
Fig. 7. As a reference we also included the raytracing duration when only one 
node is used. The values used in graphs are averages from multiple simulation 
runs. Of course, in the real scenario, the raytracing time is longer and the 
curve is not so "smooth" as in the ideal scenario. This is because in the real 
scenario some nodes can be invalid and raytracing time can be equal or even 
longer as in the cluster with fewer nodes. The results also reveal that it is not 
effective to use more than ten nodes in our current implementation of parallel 
raytracing environment. 
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Figure 7: Number of nodes to raytracing duration ratio for 1 0000 x 7500 pixels 
scene (left) and 30000 x 22500 scene (right) 



6 Conclusion 



In this paper we presented our current implementation of distributed ray- 
tracing in a cluster environment. We also introduced a CPN model of the 
implementation, which has been used to evaluate a performance of the im- 
plementation and will be used as a basis for the development of an improved 
parallelization strategy. Our intention is to evaluate possible improvements on 
corresponding CPN models and choose the best with respect to the perfor- 
mance analysis. In the case of some more complicated strategy we can specify 
an analytical model using low-level Petri nets first and use analytical facilities 
of Petri nets, such as invariants and others, including our own theoretical and 
practical results [5], for a verification of the model. 
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